Optimisation concerns search for a maximum (or minimum) of a function. An optimisation problem can be expressed quite compactly as follows: \[ \max_{x\in\mathcal{D}} f(x) \quad\text{or}\quad \underset{x\in\mathcal{D}}{\operatorname{argmax}}f(x),\] where \(\mathcal{D}\) is a subset of the function’s domain, over which we try to find a maximiser. The former is the (global) maximum, while the latter is a set of (global) maximisers. In other words, a maximiser \(x^*\in\operatorname{argmax}f(x)\) satisfies \[ f(x)\leq f(x^*),\; \forall x\in\mathcal{D}. \]
In these notes, we discuss only maximisation problems because any minimisation problem can be readily transformed into a maximisation problem as follows: \[ \underset{x\in\mathcal{D}}{\operatorname{argmin}} f(x) = \underset{x\in\mathcal{D}}{\operatorname{argmax}} -f(x).\]
Global maximisers are generally difficult to find because our knowledge of \(f\) is only local. In other words, we can evaluate \(f\) at only a finite number of points and obtain a partial picture of the overall shape of \(y=f(x)\). In one-dimensional cases \((\mathcal{D}\subset\mathbb{R})\), partial pictures may be sufficient for locating global maxima. For example, it is easy to evaluate \[ f_1(x)=\sin(3\pi x)/x+x^3-5x^2+6x \] at 301 points, interpolate them, and obtain the following graph.
With this graph, it is straightforward to find the (approximate) maximum of \(f_1(0.8)=3.3\) over the interval \([0.3, 3.5]\). This is easy because computation of the objective function is cheap, and 301 evaluations are instantaneous. However, some functions in practice are very expensive to compute, taking hours per evaluation (e.g., large-scale simulations). Moreover, modern optimisation problems often involve complex objective functions defined on high-dimensional spaces (e.g., deep neural networks). In such cases, exhaustive search is infeasible and we have to be content with local maxima. In these notes, we focus on techniques for finding local maxima.
Recall that a point \(x^*\) is a local maximiser if there exists a neighbourhood \(\mathcal{N}\subset\mathcal{D}\) of \(x^*\) such that \(f(x)\leq f(x^*)\)for all \(x\in\mathcal{N}\).
That said, there are various techniques for global optimisation, one of which is Bayesian optimisation.
We use two maximisation problems to test our algorithms. The first one: \[ \max_{x\in\mathcal{D_1}} f_1(x) = \max_{x\in\mathcal{D_1}} \frac{\sin(3\pi x)}{x}+x^3-5x^2+6x . \] where \(\mathcal{D_1} = [0.3, 3.5]\). Here is a R function that returns \(f_1\), \(f_1'\), and \(f_1''\).
fgh1 <- function(x) {
f <- sin(3*pi*x)/x+x^3-5*x^2+6*x
grad <- (3*pi*x*cos(3*pi*x) - sin(3*pi*x))/x^2 + 3*x^2 - 10*x + 6
H <- -(9*pi^2*x^2*sin(3*pi*x) + 2*(3*pi*x*cos(3*pi*x) - sin(3*pi*x)))/x^3 + 6*x - 10
return(c(f,grad,H))
}The second one: \[ \max_{(x_1,x_2)\in\mathcal{D_2}} f_2(x_1,x_2) = \max_{(x_1,x_2)\in\mathcal{D_2}} \sin\left(\frac{x_1^2}{2} - \frac{x_2^2}{4}\right)\cos\left(2{x_1}-e^{x_2}\right). \] where \(\mathcal{D_2} = [-0.5, 3]\times[-0.5, 2]\). \(y = f_2(x)\) looks like the following.
f2 <- function(x, y) sin(x^2/2 - y^2/4)*cos(2*x-exp(y))
XY <- meshgrid(seq(-.5, 3, length.out=100), seq(-.5, 2, length.out=100))
Z <- f2(XY$X, XY$Y)
plot_ly(x=XY$X, y=XY$Y, z=Z, showscale=F) %>%
layout(scene=list(xaxis=list(title="x1"),
yaxis=list(title="x2"),
zaxis=list(title="y"))) %>%
add_surface()Here is a R function that returns \(f_2\), \(\nabla f_2\), and the Hessian of \(f_2\).
fgh2 <- function(x) {
f <- sin(x[1]^2/2-x[2]^2/4)*cos(2*x[1]-exp(x[2]))
g1 <- x[1]*cos(2*x[1]-exp(x[2]))*cos(x[1]^2/2-x[2]^2/4) -
2*sin(2*x[1]-exp(x[2]))*sin(x[1]^2/2-x[2]^2/4)
g2 <- -(x[2]*cos(2*x[1]-exp(x[2]))*cos(x[1]^2/2-x[2]^2/4))/2 +
exp(x[2])*sin(2*x[1]-exp(x[2]))*sin(x[1]^2/2-x[2]^2/4)
h11 <- cos(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
x[1]^2*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1]) +
4*x[1]*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
4*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])
h12 <- -x[1]*exp(x[2])*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
x[2]*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) +
2*exp(x[2])*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1]) +
x[1]*x[2]*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])/2
h22 <- -exp(x[2])*sin(exp(x[2])-2*x[1])*sin(x[1]^2/2-x[2]^2/4) -
cos(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4)/2 -
x[2]^2*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])/4 +
exp(x[2])*x[2]*sin(exp(x[2])-2*x[1])*cos(x[1]^2/2-x[2]^2/4) -
exp(2*x[2])*sin(x[1]^2/2-x[2]^2/4)*cos(exp(x[2])-2*x[1])
return(list(f, c(g1,g2), matrix(c(h11,h12,h12,h22),2,2)))
}This is nasty, but we need this information.
The golden-section method is based on the same “bracketing” idea used in the bisection method for root finding and does not require derivative information. Here, we successively isolates smaller intervals (brackets) that contain a local maximum. Instead of two points in root finding, to isolate a local maximum, we use three points: \(x_l\), \(x_r\), and \(x_m\) such that \(x_l < x_m < x_r\). A key observation is: \[ f(x_l) < f(x_m) \quad\text{and}\quad f(x_r) < f(x_m) \] implies existence of a local maximiser in the interval \([x_l,x_r]\).
As a bracketing method, we successively shrink the interval in a way that we continue to update \(x_m\) as an estimate of a local maximiser until the stopping criterion is met.
The idea at #3 is that we want to shrink the bracket while keeping the estimate of the maximiser in the bracket. If \(f(y)<f(x_m)\), we cannot update the estimate but still want to shrink the bracket.
At step #2 above, the golden ratio arises when our motivation is to choose \(y\) so that the ratio of the lengths of the larger to the smaller subintervals remains the same. As an example, suppose we find ourselves in the following situation.
curve(-x^2, xaxt="n", yaxt="n", ann=FALSE, from=-1, to=1)
letters <- c(expression(x[l]),expression(x[m]),"y",expression(x[r]))
from_x <- c(-.9,-.1,.25,.9)
axis(1, from_x, labels=letters)
from_y <- rep(-1.2, 4)
to_x <- from_x
to_y <- sapply(from_x, function(x) return(-x^2))
segments(from_x, from_y, to_x, to_y, lty=3)
arrows(from_x[1], -.9, to_x[2], -.9, length=.1, code=3)
text((to_x[1]+to_x[2])/2, -.9, "a", pos=3)
arrows(from_x[2], -.9, to_x[4], -.9, length=.1, code=3)
text((to_x[2]+to_x[4])/2, -.9, "b", pos=3)
arrows(from_x[2], -.6, to_x[3], -.6, length=.1, code=3)
text((to_x[2]+to_x[3])/2, -.6, "c", pos=3)
points(to_x[2:3], to_y[2:3])
text(to_x[2], to_y[2]+.01, expression(f(x[m])), pos=2)
text(to_x[3], to_y[3]+.01, expression(f(y)), pos=4)
title("Illustration of the golden ratio")As you can see, the right subinterval is larger, so \(I = (x_m,x_r)\). Then, we will pick \(y\in I\) (equivalently, pick \(c\)) such that \(\frac{a}{c} = \frac{b}{a}\) as \(f(y)<f(x_m)\). Instead, if below is the situation, where we get a better estimate of the maximum, then we will pick \(y\) (or \(c\)) such that \(\frac{b-c}{c} = \frac{b}{a}\) as \(f(y) > f(x_m)\).
curve(-x^2, xaxt="n", yaxt="n", ann=FALSE, from=-1, to=1)
letters <- c(expression(x[l]),expression(x[m]),"y",expression(x[r]))
from_x <- c(-.9,-.2,.1,.9)
axis(1, from_x, labels=letters)
from_y <- rep(-1.2, 4)
to_x <- from_x
to_y <- sapply(from_x, function(x) return(-x^2))
segments(from_x, from_y, to_x, to_y, lty=3)
arrows(from_x[1], -.9, to_x[2], -.9, length=.1, code=3)
text((to_x[1]+to_x[2])/2, -.9, "a", pos=3)
arrows(from_x[2], -.9, to_x[4], -.9, length=.1, code=3)
text((to_x[2]+to_x[4])/2, -.9, "b", pos=3)
arrows(from_x[2], -.6, to_x[3], -.6, length=.1, code=3)
text((to_x[2]+to_x[3])/2, -.6, "c", pos=3)
points(to_x[2:3], to_y[2:3])
text(to_x[2], to_y[2]+.01, expression(f(x[m])), pos=2)
text(to_x[3], to_y[3]+.01, expression(f(y)), pos=4)
title("Illustration of the golden ratio")So, given \(I = (x_m,x_r)\), two situations can arise from our pick of \(y\) and we do not know which, because \(f(y)\) is unknown before picking \(y\) and evaluating \(f(y)\). Our motivation is to pick \(y\) no matter which scenario happens to be the case. Formally, pick \(y=c+x_m\) such that \[ \frac{a}{c} = \frac{b}{a} \quad \text{and} \quad \frac{b-c}{c} = \frac{b}{a} .\] Or, simply \[c = b - a.\]
Plugging this \(c\) in either equation, we get \[\begin{gather} \frac{b}{a} = \frac{a}{b-a}\\ \Leftrightarrow\quad \left(\frac{b}{a}\right)^2-\left(\frac{b}{a}\right)-1 = 0 \end{gather}\] A positive root is the golden-ratio \[\frac{b}{a} = \frac{1+\sqrt{5}}{2}.\]
In short, given three points \(x_l\), \(x_m\), and \(x_r\), we choose \[\begin{align} y &= x_m + c\\ &= x_m + \frac{a^2}{b}\\ &= x_m + \frac{(x_m-x_l)}{(1+\sqrt{5})/2}. \end{align}\]
Try to see what will happen if the left subinterval is larger and \(I = (x_l,x_m)\).
Here is an implementation of the algorithm.
gsection <- function(f, x.l, x.r, x.m, tol=1e-8) {
gr <- (1 + sqrt(5))/2
f.l <- f(x.l)[1]
f.r <- f(x.r)[1]
f.m <- f(x.m)[1]
# Main
n <- 0
while ((x.r - x.l) > tol) {
n <- n + 1
if ((x.r - x.m) > (x.m - x.l)) {
y <- x.m + (x.m - x.l)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.l <- x.m
f.l <- f.m
x.m <- y
f.m <- f.y
} else {
x.r <- y
f.r <- f.y
}
} else {
y <- x.m - (x.r - x.m)/gr
f.y <- f(y)[1]
if (f.y >= f.m) {
x.r <- x.m
f.r <- f.m
x.m <- y
f.m <- f.y
} else {
x.l <- y
f.l <- f.y
}
}
}
return(list("x"=x.m, "f"=f(x.m)[1], "n"=n))
}Now, let’s apply it to \(f_1(x)=\sin(3\pi x)/x+x^3-5x^2+6x\) with initial \((x_l,x_r,x_m)=(1.1,2.1,1.6)\).
gsection(fgh1, 1.1, 2.1, 1.6)## $x
## [1] 1.455602
##
## $f
## [1] 1.851551
##
## $n
## [1] 39
Try other initial triples and see which local maxima you find.
Thanks to the popularity of neural networks, gradient descent has been studied extensively as a class of optimisation methods.
Since we are doing maximisation, gradient ascent may be a more appropriate title.
Recall that a gradient \(\nabla f\) is a vector of partial derivatives of \(f\), containing information about the direction of fastest increase in \(f\) (as well as the rate of increase). When searching for a local maximum, it is almost natural, albeit naive, to follow the direction of gradient at each iteration. Hence, the algorithm. \[ x_{n+1} = x_n + \alpha \nabla f(x_n).\]
\(\alpha\) is a step size that determines how far we move in the direction of \(\nabla f(x_n)\). Remember that, unless \(f\) is linear, \(\nabla f\) is a function of \(x\) and \(\nabla f(x_n)\) will not tell the steepest direction once we move out of \(x_n\). So, as a rule of thumb, we should take a small step at a time.
Let’s first apply it to the 1-D problem.
gd1 <- function(fg, x0, alpha=1e-2, tol=1e-6, n.max=1e5) {
x <- x0
fg.x <- fg(x)
n <- 0
while ((abs(fg.x[2]) > tol) & (n < n.max)) {
x <- x + alpha*fg.x[2]
fg.x <- fg(x)
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[1], "n"=n))
}Note the stopping criteria. Besides the maximum number of iterations, it uses \(|\alpha f'(x)|\leq\epsilon\) because that is an actual step to be taken and \(f'(x)=0\) at a local maximum.
Try other stopping criteria such as \(|f(x_{n+1})-f(x_n)|\leq\epsilon\) and \(\|x_{n+1}-x_n\|\leq\epsilon\).
gd1(fgh1, 1.3)## $x
## [1] 1.455602
##
## $f
## [1] 1.851551
##
## $n
## [1] 19
Identify all local maxima using different initial points. You can eyeball the graph and determine points from which to hill-climb. What will happen if you start at \(x\leq0.4\) or \(x\geq3.1\)?
Next, the 2-D problem using \(x_0=(1.6, -0.3)\).
gd2 <- function(fg, x0, alpha=1e-2, tol=1e-6, n.max=1e5) {
x <- x0
grad <- fg(x)[[2]]
n <- 0
while ((max(abs(grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[[2]]
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[[1]], "n"=n))
}gd2(fgh2, c(1.6, -.3))## $x
## [1] 2.030697 1.401526
##
## $f
## [1] 1
##
## $n
## [1] 1102
But, if using \(x_0=(-0.2, 0.3)\),
One-dimensional problems are relatively easy. We need not too much worry about a step size. In the end, there is only two directions to move — left or right. So, as long as a step size is not excessively large, we will just march on and get there. However, it may not be obvious how large is excessively large. To illustrate, let’s examine the following situation, where \(f'(x) = -4x^3+33x^2-76x+40\), which implies \(f'(x)=0\) at \(x\in\{0.75,2.93,4.57\}\).
Suppose \(x_n=0\), where we have \(f'(0)=40\). Thus, \(x_{n+1} = 40\alpha\) implying that, if we use \(\alpha>0.074\), we will have \(x_{n+1}>2.93\), miss the first local maximum, and climb up towards the second local maximum \(x^*=4.57\).
The issue is much more difficult to handle in higher dimensions, where there are the infinite number of directions to choose from and we could circle around so long before approaching any local maximum. Of course, if we always use a very small \(\alpha\), progress might be too slow to be practical.
One solution to the step size selection is to choose: \[ \alpha^* \in \underset{\alpha\geq 0}{\operatorname{argmax}} g(\alpha) = \underset{\alpha\geq 0}{\operatorname{argmax}} f(x_n + \alpha \nabla f(x_n)) . \] This is another maximisation problem. But, it is a 1-D problem and relatively easy to solve.
Remember that this maximisation problem with respect to \(\alpha\) is solved at each iteration within the origianl maximisation problem.
Here, we try to find a maximiser \(\alpha^*\) using brute-force search, which is naively simple but often effective in 1-D problems. Specifically, generate 1,000 \(\alpha\)s equally-spaced between \(0\) and \(0.1\), evaluate \(g\) at each \(\alpha\), and return \(\alpha^*\) such that \(g(\alpha^*)\) is the largest among 1,000 \(g(\alpha)\)s.
Brute-force optimisation is out of the question in high-dimensional problems. But, even in 1-D problems, it depends and may not be feasible if a search space is large or fine-grid search is required (e.g., \(\alpha=10^{-5}\) and \(\alpha=10^{-6}\) make significant difference.)
First, the 1-D problem.
line.search1 <- function(fg, x0, tol=1e-6, n.max=1e5) {
g <- function(alpha, x, grad) fg(x + alpha*grad)[1]
x <- x0
grad <- fg(x)[2]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- 0
while ((abs(grad) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[2]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[[1]], "n"=n))
}line.search1(fgh1, 1.3)## $x
## [1] 1.455602
##
## $f
## [1] 1.851551
##
## $n
## [1] 3
Next, the 2-D problem.
line.search2 <- function(fg, x0, tol=1e-6, n.max=1e5) {
g <- function(alpha, x, grad) fg(x+alpha*grad)[[1]]
x <- x0
grad <- fg(x)[[2]]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- 0
while (max((abs(alpha*grad)) > tol) & (n < n.max)) {
x <- x + alpha*grad
grad <- fg(x)[[2]]
# Brute-force search in alpha
Alphas <- seq(0, .1, length.out=1e3)
Gs <- sapply(Alphas, g, x=x, grad=grad)
alpha <- Alphas[which.max(Gs)]
n <- n + 1
}
return(list("x"=x, "f"=fg(x)[[1]], "n"=n))
}line.search2(fgh2, c(1.6, -.3))## $x
## [1] 2.030693 1.401524
##
## $f
## [1] 1
##
## $n
## [1] 96
While ignoring the cost for inner optimisation, we see significant reduction in the number of steps taken to reach local maxima.
In 1-D problems, Newton’s method is simply to apply the Newton-Raphson method to find a root of \(f'\). This makes sense because \(f'(x)=0\) is a necessary condition for a local maximum (with a catch of being merely necessary). Therefore, assuming that \(f\) is twice continuously differentiable, we have: \[ x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)} .\]
For some reason, when the Newton-Raphson method is applied for optimisation, it is just called Newton’s method. Poor Joseph Raphson…
To generalize it to multi-dimensional problems, let’s first recall the key idea used in the Newton-Raphson method. When iteratively searching for a root of \(g\), the method uses linear approximation of \(g\) at the current guess \(x_n\) as it is easy to find a root of linear function. Then, the approximated root becomes our next guess \(x_{n+1}\).
Now, imagine a vector-valued function with \(k\) inputs and \(k\) outputs: \(g:\mathbb{R}^k\to\mathbb{R}^k\). In this case, the equivalents of root and derivative are respectively \(0\) vector and the Jacobian matrix (\(J\)). Doing the same approximation at \(x_n\in\mathbb{R}^k\), we get: \[ g(x) \approx g(x_n) + J(x_n)(x-x_n) .\] Setting the left-hand side equal to \(0\) and rearranging, we derive: \[ x_{n+1} = x_n - J(x_n)^{-1}g(x_n) .\]
Assuming \(J\) has a non-zero determinant at \(x_n\).
Based on this, if we assume the objective function \(f\) is twice continuously differentiable, then we have multi-dimensional Newton’s method: \[ x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n) ,\] where \(H\) is the Hessian matrix of \(f\).
To see the correspondence, remember that for \(f:\mathbb{R}^k\to\mathbb{R}\), \(\nabla f\) is a vector-valued function and its Jacobian is the Hessian of \(f\).
Another derivation is:
Let’s apply Newton’s method to the 2-D testbed.
newton <- function(fgh, x0, tol=1e-6, n.max = 100) {
x <- x0
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- 0
while ((max(abs(grad)) > tol) & (n < n.max)) {
x <- x - solve(H, grad)
grad <- fgh(x)[[2]]
H <- fgh(x)[[3]]
n <- n + 1
}
return(list("x"=x, "f"=fgh(x)[[1]], "n"=n))
}newton(fgh2, c(1.6, -.3))## $x
## [1] 0.02637089 5.30394570
##
## $f
## [1] -0.681151
##
## $n
## [1] 9
What happened? When experimenting various \(x_0\) with the test function, you will see Newton’s method is extremely sensitive to the choice of \(x_0\) and not very reliable. Since the method is based on root finding of \(\nabla f\), it may converge to wherever the gradient vanishes: minima, maxima or saddle points.
When Newton’s method works, its convergence is quick, often using only a fraction of steps taken by line search. For example, it works for \[ f(x,y) = x(x-1)(x+1) - y(y-1)(y+1) \] plotted below. The price for speed is the requirement of second derivatives, which may be infeasible to compute in complex problems.
f3 <- function(x, y) x*(x-1)*(x+1) - y*(y-1)*(y+1)
XY = meshgrid(seq(-1, 1, length.out=100), seq(-1, 1, length.out=100))
Z <- f3(XY$X, XY$Y)
plot_ly(x=XY$X, y=XY$Y, z=Z, showscale=F) %>% add_surface()Computation of the gradient is often a non-trivial task in modern optimisation problems. Automatic differentiation is a class of algorithms for scalable and accurate gradient computation. The intuition behind is that we may break a function into primitive operations, for each of which it is easy to compute the derivative, and stitch them together. For example, the above test function \[ f_2(x_1,x_2) = \sin\left(\frac{x_1^2}{2} - \frac{x_2^2}{4}\right)\cos\left(2{x_1}-e^{x_2}\right) \] can be represented by the following computational graph.
Notice that it is easy to compute the component partial derivatives: \[ \frac{\partial v_8}{\partial x_2} = e^{x_2},\quad \frac{\partial v_6}{\partial v_5} = \cos v_5,\quad \frac{\partial f}{\partial v_{10}} = v_6,.... \] Now, we can use the chain rule to find \(\frac{\partial f}{\partial x_{1}}\) and \(\frac{\partial f}{\partial x_{2}}\) in terms of these components: \[\begin{align*} \frac{\partial f}{\partial x_{1}} &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial x_1} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial x_1}\\ &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial v_5}\frac{\partial v_5}{\partial x_1} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial x_1}\\ &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial v_5}\frac{\partial v_5}{\partial v_3}\frac{\partial v_3}{\partial x_1} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial v_7}\frac{\partial v_7}{\partial x_1}\\ &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial v_5}\frac{\partial v_5}{\partial v_3}\frac{\partial v_3}{\partial v_1}\frac{\partial v_1}{\partial x_1} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial v_7}\frac{\partial v_7}{\partial x_1}\\ \frac{\partial f}{\partial x_{2}} &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial x_2} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial x_2}\\ &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial v_5}\frac{\partial v_5}{\partial x_2} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial x_2}\\ &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial v_5}\frac{\partial v_5}{\partial v_4}\frac{\partial v_4}{\partial x_2} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial v_8}\frac{\partial v_8}{\partial x_2}\\ &= \frac{\partial f}{\partial v_6}\frac{\partial v_6}{\partial v_5}\frac{\partial v_5}{\partial v_4}\frac{\partial v_4}{\partial v_2}\frac{\partial v_2}{\partial x_2} + \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial v_8}\frac{\partial v_8}{\partial x_2} \end{align*}\]
Spot some shared factors, which can be exploited for computational & storage efficiency.
Methods based on automatic differentiation is commonly categorised into two groups: forward accumulation method and reverse accumulation method. As the names suggest, if you start with an input node (e.g., \(x_1\)) and compute each partial derivative by following the arrows, you are doing a forward accumulation method. For example, accumulate partial derivatives in the following order: \[ \frac{\partial v_7}{\partial x_1} \to \frac{\partial v_9}{\partial v_7} \to \frac{\partial v_{10}}{\partial v_9} \to \frac{\partial f}{\partial v_{10}} \] to get \[ \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial v_7}\frac{\partial v_7}{\partial x_1}. \] In contrast, if you start with the output node (i.e., \(f\)) and compute each partial derivative by going against the arrows, you are doing a reverse accumulation method. For example, accumulate partial derivatives in the following order: \[ \frac{\partial f}{\partial v_{10}} \to \frac{\partial v_{10}}{\partial v_9} \to \frac{\partial v_9}{\partial v_7} \to \frac{\partial v_7}{\partial x_1} \] to get \[ \frac{\partial f}{\partial v_{10}}\frac{\partial v_{10}}{\partial v_9}\frac{\partial v_9}{\partial v_7}\frac{\partial v_7}{\partial x_1}. \]
It is dubbed “automatic” (I guess) because component derivatives in forward methods are simultaneously computed when evaluating the function itself. Forward accumulation methods are typically implemented using dual numbers, whose mathematical properties are just mind-blowing🤯 Forward methods are simpler but less suitable for functions on high-dimensional input space. In R, there is a package dual to use dual numbers.
For example, by evaluating \(f_2\) at \((x_1,x_2)=(1,2)\) using a dual number, you get not only \(f_2(1,2)\) but also \(\nabla f_2(1,2)\) automatically
library(dual)
x1 <- dual(f=1, grad=c(1,0))
x2 <- dual(f=2, grad=c(0,1))
f2(x1,x2)## Real: -0.300215
## Duals: 1.297122 -3.311502
which is the same as the one obtained from the manually defined gradient
fgh2(c(1,2))[1:2]## [[1]]
## [1] -0.3002153
##
## [[2]]
## [1] 1.297122 -3.311502
Below, we cover a type of reverse method called backpropagation, as the foundational algorithm for training deep neural networks.
In statistics and machine learning, we often want to minimise the following objective function\[ C: \mathbb{R}^d\to\mathbb{R};\; w \mapsto \sum_{j=1}^m(y_j-f(x_j;w))^2,\] where \(f\) is a statistical model parameterised by \(w\) and \(\{(x_j,y_j)\}_{j=1}^m\) is data to which \(f\) is fit. It is a least-squares problem: \[\min_{w\in\mathbb{R}^d} C(w) = \min_{w\in\mathbb{R}^d} \sum_{j=1}^m(y_j-f(x_j;w))^2.\] To proceed with the gradient descent,
\[ w_{n+1} = w_n - \alpha \nabla C(w_n),\]we need to compute the gradient \(\nabla C(w_n)\) every time we take a step \(\alpha \nabla C(w_n)\) from \(w_n\) to \(w_{n+1}\). Mathematically, it is simply a partial derivative \(\frac{\partial C}{\partial w_i}(w_n)\) for \(i\in\{1,\dots,d\}\). As long as we have this partial derivative function implemented in R for all \(i\), we might just call each of \(d\) functions at every step \(n\in\{1,2,\dots,N\}\), and we reach a local minimum after \(N\) steps. In practice, this simple idea can easily become infeasible.
First, required number of steps \(N\) depends on models \(f\), step sizes \(\alpha\), and quality of initial points \(x_0\). Second, in modern optimisation problems, \(d\) tends to be large, and \(\nabla C\) comprises \(d\) complicated functions. Even a single evaluation of such \(\nabla C\) may take non-trivial time. For example, in deep learning, it is common to have millions of parameters \((d>10^6)\) that specify a neural network model. Estimation of those parameters may take days or even weeks.
Let’s use a simple neural network to understand common challenges of computing the gradient and a workaround. In standard regression problems, a neural network is a real-valued function \(f:\mathbb{R}^k\to\mathbb{R}\) composed in a special way. The composition is typically illustrated using a network diagram as below.
In this figure, we have \(k=4\). Each arrow represents a scalar number passed between two nodes. Specifically, \(w_{lj}^{(k)}a_j^{(k)}\) is the product of \(w_{lj}^{(k)}\) and \(a_j^{(k)}\), where \(a_j^{(k)}\) as an output of node \(j\) in layer \(L_k\) is multiplied by an associated weight \(w_{lj}^{(k)}\) and passed to node \(l\) in \(L_{k+1}\), where \(k\in\{1,\dots,K\}\). For example, \(w_{32}^{(1)}\) is a weight parameter for an output \(a_2^{(1)}\) from Green#2 passed to Purple#3.
Except for nodes in the input layer, every node represents a function that takes input values from connected nodes in the previous layer and outputs a scalar number. Specifically, an output from node \(l\) in \(L_{k+1}\) is computed as follows: \[\begin{align} z_l^{(k+1)} &= w_{l0}^{(k)} + \sum_{j=1}^{p_{k}}w_{lj}^{(k)}a_j^{(k)}\\ a_l^{(k+1)} &= g^{(k+1)}(z_l^{(k+1)}), \end{align}\] where
The overall function \(f = a_1^{(K)}\) is a composition of these operations.
Even in such a small neural network, there are \((4\times 5 + 5) + (5\times 1 + 1) = 31\) parameters to optimise. You can imagine that, with a high-dimensional input \(x\), several layers, and many nodes in each layer, the number of parameters can be quite large.
Since \(f\) is a composition of explicit \(g\) and linear transformation, using the chain rule, we could in principle write down a partial derivative expression with respect to every single \(w_{lj}^{(k)}\) and separately call each derivative function when needed. But, this would be infeasible for complicated compositions. In the case of neural networks, we should exploit the hierarchical structure of \(f\) for efficient computation. Backpropagation is an optimisation technique that makes fitting complicated models computationally feasible.
When the minimisation problem is \[\min_{w\in\mathbb{R}^d} C(w) = \min_{w\in\mathbb{R}^d} \sum_{j=1}^m c_j(w),\] a partial derivative of \(C(w)\) is the sum of partial derivatives of \(c_j(w)\) for all \(j\). In the following derivation, therefore, we focus on a representative data point \((x,y)\) and the associated term \(c(w)\).
To compute \(c(w)\), first, we evaluate \(f\) at \(x\) using some initial \(w\). In the above figure, this computation flows from the left to the right, so-called forward propagation as it looks like moving input \(x\) forward to \(f(x)\). In addition to the final \(f(x)\) value, along the way, we store intermediary values \(a_l^{(k)}\) and \(z_l^{(k)}\) for all \(l\) and \(k\).
We begin with the last layer \(L_{K}\), where there is a single output node \(a_1^K\). A partial derivative of \(c(w)\) with respect to \(w_{1j}^{(K-1)}\), one of the weights connected to \(a_1^K\), is straightforward: \[\begin{align} \frac{\partial c}{\partial w_{1j}^{(K-1)}} &= \frac{\partial c}{\partial a_1^{(K)}} \frac{\partial a_1^{(K)}}{\partial z_1^{(K)}} \frac{\partial z_1^{(K)}}{\partial w_{1j}^{(K-1)}}\\ &= \left(\frac{\partial z_1^{(K)}}{\partial w_{1j}^{(K-1)}}\right) \left(\frac{\partial c}{\partial a_1^{(K)}} \frac{\partial a_1^{(K)}}{\partial z_1^{(K)}}\right)\\ &= a_j^{(K-1)} \delta_1^{(K)}. \end{align}\]
A quantity \(\delta_i^{(k)} := \frac{\partial c}{\partial a_i^{(k)}} \frac{\partial a_i^{(k)}}{\partial z_i^{(k)}} = \frac{\partial c}{\partial z_i^{(k)}}\) for \(i\in\{1,\dots,P_{k}\}\) is often called an error for node \(i\) in \(L_k\). Remember that, during the forward propagation, we have already computed and stored \(a_j^{(K-1)}\). So, here, we just need to compute \(\delta_1^{(K)}\).
With the following figure as a visual aid, try to compute a partial derivative of \(c(w)\) with respect to a weight \(w_{lj}^{(k-1)}\) for \(L_k\) \((k\in\{2,\dots,K-1\})\).
It looks complicated, but what is happening is simply applying the chain rule several times. We just need to keep track of indices, which is computers are good at and humans are not…
\[\begin{align} \frac{\partial c}{\partial w_{lj}^{(k-1)}} &= \left(\frac{\partial c}{\partial a_l^{(k)}} \frac{\partial a_l^{(k)}}{\partial z_l^{(k)}}\right) \frac{\partial z_l^{(k)}}{\partial w_{lj}^{(k-1)}}\\ &= \left(\sum_{i=1}^{P_{k+1}} \frac{\partial c}{\partial z_i^{(k+1)}} \frac{\partial z_i^{(k+1)}}{\partial a_l^{(k)}} \frac{\partial a_l^{(k)}}{\partial z_l^{(k)}}\right) \frac{\partial z_l^{(k)}}{\partial w_{lj}^{(k-1)}}\\ &= \sum_{i=1}^{P_{k+1}} \delta_i^{(k+1)} \frac{\partial z_i^{(k+1)}}{\partial a_l^{(k)}} \frac{\partial a_l^{(k)}}{\partial z_l^{(k)}} \frac{\partial z_l^{(k)}}{\partial w_{lj}^{(k-1)}}\\ &= \sum_{i=1}^{P_{k+1}} \delta_i^{(k+1)} w_{il}^{(k)} \dot{g}^{(k)}(z_l^{(k)}) a_j^{(k-1)}\\ &= a_j^{(k-1)}\left(\sum_{i=1}^{P_{k+1}} w_{il}^{(k)} \delta_i^{(k+1)}\right) \dot{g}^{(k)}(z_l^{(k)})\\ &\equiv a_j^{(k-1)} \delta_l^{(k)} \end{align}\] where \(\dot{g}^{(k)}(z_l^{(k)}) = \frac{\partial a_l^{(k)}}{\partial z_l^{(k)}}(z_l^{(k)})\) is the derivative of \(g^{(k)}\) evaluated at \(z_l^{(k)}\), which we have already computed and stored during the forward propagation. Also notice that each \(\delta_i^{(k+1)}\) in the summand has already been computed when computing \(\frac{\partial c}{\partial w_{ij}^{(k)}}\).
To recap, the key factor is \[\sum_{i=1}^{P_{k+1}} w_{il}^{(k)} \delta_i^{(k+1)}.\] To compute a partial derivative of \(c(w)\) with respect to a weight \(w_{lj}^{(k-1)}\) for \(L_k\) \((k\in\{2,\dots,K-1\})\), we need information about how much this weight \(w_{lj}^{(k-1)}\) (or equivalently \(w_{lj}^{(k-1)}a_j^{(k-1)}\) as an input for node \(l\) in \(L_{k}\)) contributes to the objective function \(c(w)\). For \(k=K\), since there is a direct connection between \(w_{1j}^{(K-1)}\) and \(c(w)\), computation is simple. For \(k<K\), \(w_{lj}^{(k-1)}\) first contributes to an output \(a_l^{(k)}\) at node \(l\) in \(L_k\), which in turns contributes to multiple nodes in \(L_{k+1}\). This is the reason for the above summation — a weighted sum of the errors \(\delta_i^{(k+1)}\) at \(L_{k+1}\).
The technique is called backpropagation, because of this information passing in the reverse direction, i.e., sequential computation of \(\delta_l^{(k)}\) for \(k<K\) starting from \(\delta_1^{(K)}\) at the output layer. Despite the complicated \(f\), thanks to many values shared and reused when computing each partial derivative, the overall computation of the gradient is scalable and feasible even for very high-dimensional \(w\).